An automated method to build a corpus of rhetorically-classified sentences in biomedical texts

نویسندگان

  • Hospice Houngbo
  • Robert E. Mercer
چکیده

The rhetorical classification of sentences in biomedical texts is an important task in the recognition of the components of a scientific argument. Generating supervised machine learned models to do this recognition requires corpora annotated for the rhetorical categories Introduction (or Background), Method, Result, Discussion (or Conclusion). Currently, a few, small annotated corpora exist. We use a straightforward feature of co-referring text using the word “this” to build a selfannotating corpus extracted from a large biomedical research paper dataset. The corpus is annotated for all of the rhetorical categories except Introduction without involving domain experts. In a 10-fold cross-validation, we report an overall Fscore of 97% with Naı̈ve Bayes and 98.7% with SVM, far above those previously reported.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automated Construction and Evaluation of Japanese Web-based Reference Corpora

A particularly promising approach to the use of the Web for linguistic research is to build corpora via automated queries to search engines, retrieving and post-processing the pages found in this way (Ghani et al. 2003, Baroni and Bernardini 2004, Sharoff to appear). This approach differs from the traditional method of corpus construction, where one needs to spend considerable time finding and ...

متن کامل

PaCCSS-IT: A Parallel Corpus of Complex-Simple Sentences for Automatic Text Simplification

In this paper we present PaCCSS–IT, a Parallel Corpus of Complex–Simple Sentences for ITalian. To build the resource we develop a new method for automatically acquiring a corpus of complex–simple paired sentences able to intercept structural transformations and particularly suitable for text simplification. The method requires a wide amount of texts that can be easily extracted from the web mak...

متن کامل

An Automated MR Image Segmentation System Using Multi-layer Perceptron Neural Network

Background: Brain tissue segmentation for delineation of 3D anatomical structures from magnetic resonance (MR) images can be used for neuro-degenerative disorders, characterizing morphological differences between subjects based on volumetric analysis of gray matter (GM), white matter (WM) and cerebrospinal fluid (CSF), but only if the obtained segmentation results are correct. Due to image arti...

متن کامل

Producing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations

The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create...

متن کامل

A Simple Ensemble Method for Hedge Identification

We present in this paper a simple hedge identification method and its application on biomedical text. The problem at hand is a subtask of CoNLL-2010 shared task. Our solution consists of two classifiers, a statistical one and a CRF model, and a simple combination schema that combines their predictions. We report in detail on each component of our system and discuss the results. We also show tha...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014